Power Generation Prediction

We're provided with power and weather data. We're expected to predict power generated over the next 1month. Weather data is also available.

Table of Contents:

  1. Libraries + Dependencies.
  2. Data Exploration and Sanity Checks (Understanding the datasets).
  3. EDA and Time Series Analysis:
    • Univariate Non-graphical EDA.
    • Univariate Graphical EDA.
    • Bivariate Graphical EDA.
    • Multivariate Graphical EDA.
  4. Preprocessing:

    • Null values.
    • Outliers.
    • Encoding Categorical Columns.
    • Standardisation/Normalisation.
    • PCA
  5. Model Training, Evaluation and Prediction.

Libraries and Dependancies.

1. Data Exploration and Sanity Checks (Understanding the dataset)

power_actual

weather - actual

weather - forecast

2. EDA and Time Series Analysis.

Like mentioned earlier: we generate our train set by merging the power and actual_weather datasets. By doing so, some data is lost because entries in the power data have 15mins time intervals while entries in the actual weather dataset have 1hr time intervals. Regardless, we merge the 2 on the weather data datetime, and perform no aggregation since power recorded at a particular time is instantaneous.

Univariate non-graphical

Involves calculation of spread, measures of central tendency among other summaries without using graphical presentations.

Dataset has numeric (float, int), categorical (object) and timestamp(datetime) features.

Some features like precip_type, visibility, ... have missing values.

Distribution plots in the next section will confirm these observations.

Univariate Graphical

These plots align with the observations from skewness and kurtosis.

Bivariate Graphical

Exploring relationships between 2 variables using plots.

There's no particular trend nor seasonality.

There's missing/no records for March-Aug 2018.

Power generation peaks in the month of July 2019.

Power generation is higher in 2019 than in 2018, ~ cause of the July peak.

Power production is higher during the weekends than on weekdays.

Power generation is lowest on Thursday, from where it starts increasing till Monday when it flattens then starts reducing from Wednesay.

Most of the summaries are associate with very low to 0 power generation apart from when it is: humid and partly cloudy, humid and mostly cloudy, humif and overcast, possible light rain and humid, humid. Humidity is the constant in all the cases thus must greatly affect power generation.

Just like the case in summary, most icons are associated with very low to 0 power generation, apart from: clear_days, partly cloudy days, cloudy and rainy days. Cloud is the constant in this case.

Also observe that we also have rain, a rainy day indicates a humid day, a cloudy day indicates a humid day (clody day ~ not sosunny ~ low rates of evaporation ~ atmospheric moitur retention), thus humidity and cloud cover play important roles in power generatttion.

We'll see this in features importance during modeling.

Most of the summaries have a uniform high humidity levels of around 0.9 except for a few associated with low humidity e.g when its clear, partly cloudy, mostly cloudy and overcast

Multivariate Graphical

Exploring relationships between more than 2 variables using plots.

The weather features are not correlated with power. Interesting.

But some features have a high correlation with each other e.g: temperature and apparent_temmperature, wind speed and wind gust, precipitation intensity and precipitation probability.

Scatter plots of all possible bivariate combinations. Some plots indicate linear relationships between variables, others indicate variables with only 2 possibilities, some indicate randomness (no particular relationship).

The 2019 boxplot behaviour can be explained by the July peak which surely looks like an 'outlier'.

3. Preprocessing

Before the modeling phase, we have to preprocess both our train and test sets. For uniform preprocessing, we concatenate the train features (==weather_actual dataset) with the test data set (==weather forecast).

Null Values

We handle null values before modeling in order to attain robust models at the end.

Outliers

we use the train set to observe the outliers then apply the desired changes to the data dataset.

we have outliers in the dew_point column both in the train and test sets, so we can't delete the entries in the test set.

we'll drop it after reconstructing our train and test sets.

Encoding Categorical Columns

Standardization/Normalisation

Heavily skewed to the right: we do log transformation.

Not perfect, but definitely a better representation.

Dimensionality Reduction (PCA)

We're doing this to take care of the problem of multicolinearity, and to drop the ohe generated values with predominant 0s.

99% of the cariation is caused by arouind 8 - 10 components.

4. Model Training, Evaluation and Predictions.

A very good performance from a very simple model. Now lets use xgb and lgbm with KFolds and observe the performance change.